How Diverse Are Diversity Assessment Methods? A Comparative Analysis and Benchmarking of Molecular Descriptor Space

نویسندگان

  • Alexios Koutsoukas
  • Shardul Paricharak
  • Warren R. J. D. Galloway
  • David R. Spring
  • Adriaan P. IJzerman
  • Robert C. Glen
  • David Marcus
  • Andreas Bender
چکیده

Chemical diversity is a widely applied approach to select structurally diverse subsets of molecules, often with the objective of maximizing the number of hits in biological screening. While many methods exist in the area, few systematic comparisons using current descriptors in particular with the objective of assessing diversity in bioactivity space have been published, and this shortage is what the current study is aiming to address. In this work, 13 widely used molecular descriptors were compared, including fingerprint-based descriptors (ECFP4, FCFP4, MACCS keys), pharmacophore-based descriptors (TAT, TAD, TGT, TGD, GpiDAPH3), shape-based descriptors (rapid overlay of chemical structures (ROCS) and principal moments of inertia (PMI)), a connectivity-matrix-based descriptor (BCUT), physicochemical-property-based descriptors (prop2D), and a more recently introduced molecular descriptor type (namely, "Bayes Affinity Fingerprints"). We assessed both the similar behavior of the descriptors in assessing the diversity of chemical libraries, and their ability to select compounds from libraries that are diverse in bioactivity space, which is a property of much practical relevance in screening library design. This is particularly evident, given that many future targets to be screened are not known in advance, but that the library should still maximize the likelihood of containing bioactive matter also for future screening campaigns. Overall, our results showed that descriptors based on atom topology (i.e., fingerprint-based descriptors and pharmacophore-based descriptors) correlate well in rank-ordering compounds, both within and between descriptor types. On the other hand, shape-based descriptors such as ROCS and PMI showed weak correlation with the other descriptors utilized in this study, demonstrating significantly different behavior. We then applied eight of the molecular descriptors compared in this study to sample a diverse subset of sample compounds (4%) from an initial population of 2587 compounds, covering the 25 largest human activity classes from ChEMBL and measured the coverage of activity classes by the subsets. Here, it was found that "Bayes Affinity Fingerprints" achieved an average coverage of 92% of activity classes. Using the descriptors ECFP4, GpiDAPH3, TGT, and random sampling, 91%, 84%, 84%, and 84% of the activity classes were represented in the selected compounds respectively, followed by BCUT, prop2D, MACCS, and PMI (in order of decreasing performance). In addition, we were able to show that there is no visible correlation between compound diversity in PMI space and in bioactivity space, despite frequent utilization of PMI plots to this end. To summarize, in this work, we assessed which descriptors select compounds with high coverage of bioactivity space, and can hence be used for diverse compound selection for biological screening. In cases where multiple descriptors are to be used for diversity selection, this work describes which descriptors behave complementarily, and can hence be used jointly to focus on different aspects of diversity in chemical space.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): comparative study of 13 amino acid descriptor sets

BACKGROUND While a large body of work exists on comparing and benchmarking of descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking. Hence, in the current work a total of 13 different protein descriptor sets have been compared with respect to their behavior in perceiving similarities between amino acids. The descriptor sets included in the study are Z-s...

متن کامل

Wheat and barley seed system in Syria: How diverse are wheat and barley varieties and landraces from farmer’s fields?

"> The present study described the diversity of wheat and barley varieties andlandraces available in farmer’s fields in Syria using different indicators. Analysisof spatial and temporal diversity and coefficient of parentage along withmeasurements of agronomic and morphological traits were employed to explain thediversity of wheat and barley varieties or landraces grown by farmers in Syria.Farm...

متن کامل

The congruence between matrilineal genetic (mtDNA) and geographic diversity of Iranians and the territorial populations

Objective(s):From the ancient era, emergence of Agriculture in the connecting region of Mesopotamia and the Iranian plateau at the foothills of the Zagros Mountains, made Iranian gene pool as an important source of populating the region. It has differentiated the population spread and different language groups. In order to trace the maternal genetic affinity between Iranians and other populatio...

متن کامل

A comparative analysis on the results of different methods in assessment land use mix in Mashhad

In recent years, land use mix has become an important issue of ​​research in urban studies. One of the important issues related to land use mix is how to measure it in urban areas. Given that many studies have assumed land use mix as an independent variable affecting other urban life dimensions, Therefore, how it is measured is crucial for achieving reliable results. The purpose of this researc...

متن کامل

A Robust Missing Value Imputation Method MifImpute For Incomplete Molecular Descriptor Data And Comparative Analysis With Other Missing Value Imputation Methods

Missing data imputation is an important research topic in data mining. Large-scale Molecular descriptor data may contains missing values (MVs). However, some methods for downstream analyses, including some prediction tools, require a complete descriptor data matrix. We propose and evaluate an iterative imputation method MiFoImpute based on a random forest. By averaging over many unpruned regres...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of chemical information and modeling

دوره 54 1  شماره 

صفحات  -

تاریخ انتشار 2014